In this article, the author analyzes the relationship between college majors and earnings potential, using data from the American Community Survey (ACS) and the U.S. Department of Education’s College Scorecard. The article emphasizes the importance of considering the cost of education in relation to potential earnings, as well as the long-term outlook for job growth in various fields. The article provides a link to the github repository for the data sources but does not have much code or any graphs (just some aggregation). I will be conducting some steps for data wrangling as well as producing some graphs that show helpful analysis about college majors 9 years back.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggthemes)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
Loading Data
recent_grads <- read.csv("recent-grads.csv")
head(recent_grads)
## Rank Major_code Major Total Men Women
## 1 1 2419 PETROLEUM ENGINEERING 2339 2057 282
## 2 2 2416 MINING AND MINERAL ENGINEERING 756 679 77
## 3 3 2415 METALLURGICAL ENGINEERING 856 725 131
## 4 4 2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING 1258 1123 135
## 5 5 2405 CHEMICAL ENGINEERING 32260 21239 11021
## 6 6 2418 NUCLEAR ENGINEERING 2573 2200 373
## Major_category ShareWomen Sample_size Employed Full_time Part_time
## 1 Engineering 0.1205643 36 1976 1849 270
## 2 Engineering 0.1018519 7 640 556 170
## 3 Engineering 0.1530374 3 648 558 133
## 4 Engineering 0.1073132 16 758 1069 150
## 5 Engineering 0.3416305 289 25694 23170 5180
## 6 Engineering 0.1449670 17 1857 2038 264
## Full_time_year_round Unemployed Unemployment_rate Median P25th P75th
## 1 1207 37 0.01838053 110000 95000 125000
## 2 388 85 0.11724138 75000 55000 90000
## 3 340 16 0.02409639 73000 50000 105000
## 4 692 40 0.05012531 70000 43000 80000
## 5 16697 1672 0.06109771 65000 50000 75000
## 6 1449 400 0.17722641 65000 50000 102000
## College_jobs Non_college_jobs Low_wage_jobs
## 1 1534 364 193
## 2 350 257 50
## 3 456 176 0
## 4 529 102 0
## 5 18314 4440 972
## 6 1142 657 244
ggplot(recent_grads, aes(Median)) +
geom_histogram(color ="black", fill = 'pink')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
> We can see that most of the majors make around 30k / year, I assume
this data is 9 years ago, therefore the salary is a bit low compared to
today. It can even be that this salary is right out of college too.
There are also some majors’ salary are very high too which is close to
100k or more.
I’m making a new df to for capitalize the Major column and following Median column order.
majors_processed <- recent_grads %>%
arrange(desc(Median)) %>%
mutate(Major = str_to_title(Major),
Major = fct_reorder(Major, Median))
head(majors_processed)
## Rank Major_code Major Total Men Women
## 1 1 2419 Petroleum Engineering 2339 2057 282
## 2 2 2416 Mining And Mineral Engineering 756 679 77
## 3 3 2415 Metallurgical Engineering 856 725 131
## 4 4 2417 Naval Architecture And Marine Engineering 1258 1123 135
## 5 5 2405 Chemical Engineering 32260 21239 11021
## 6 6 2418 Nuclear Engineering 2573 2200 373
## Major_category ShareWomen Sample_size Employed Full_time Part_time
## 1 Engineering 0.1205643 36 1976 1849 270
## 2 Engineering 0.1018519 7 640 556 170
## 3 Engineering 0.1530374 3 648 558 133
## 4 Engineering 0.1073132 16 758 1069 150
## 5 Engineering 0.3416305 289 25694 23170 5180
## 6 Engineering 0.1449670 17 1857 2038 264
## Full_time_year_round Unemployed Unemployment_rate Median P25th P75th
## 1 1207 37 0.01838053 110000 95000 125000
## 2 388 85 0.11724138 75000 55000 90000
## 3 340 16 0.02409639 73000 50000 105000
## 4 692 40 0.05012531 70000 43000 80000
## 5 16697 1672 0.06109771 65000 50000 75000
## 6 1449 400 0.17722641 65000 50000 102000
## College_jobs Non_college_jobs Low_wage_jobs
## 1 1534 364 193
## 2 350 257 50
## 3 456 176 0
## 4 529 102 0
## 5 18314 4440 972
## 6 1142 657 244
I want to create a new df to see the Total Observations of each Major_Category:
by_major_category <- majors_processed %>%
filter(!is.na(Total)) %>%
group_by(Major_category) %>%
summarize(Men = sum(Men),
Women = sum(Women),
Total = sum(Total),
College_jobs = sum(College_jobs),
Non_college_jobs = sum(Non_college_jobs),
MedianSalary = sum(Median * Sample_size) / sum(Sample_size)) %>%
mutate(ShareWomen = Women / Total)
head(by_major_category)
## # A tibble: 6 × 8
## Major_category Men Women Total Colle…¹ Non_c…² Media…³ Share…⁴
## <chr> <int> <int> <int> <int> <int> <dbl> <dbl>
## 1 Agriculture & Natural Re… 40357 35263 7.56e4 18677 33217 35586. 0.466
## 2 Arts 134390 222740 3.57e5 94785 163720 32046. 0.624
## 3 Biology & Life Science 184919 268943 4.54e5 151233 127182 34379. 0.593
## 4 Business 667852 634524 1.30e6 148538 496570 40890. 0.487
## 5 Communications & Journal… 131921 260680 3.93e5 86556 172992 34738. 0.664
## 6 Computers & Mathematics 208725 90283 2.99e5 137859 74463 46993. 0.302
## # … with abbreviated variable names ¹College_jobs, ²Non_college_jobs,
## # ³MedianSalary, ⁴ShareWomen
Let’s see the correlation between variables within the dataframe we just created.
pairs(by_major_category[,2:8], lower.panel = NULL)
> Looks like there are no significant variables that are corelated
with each other.
Let’s see which Major_category makes the most in terms of salary
majors_processed %>%
group_by(Major_category) %>%
summarize(Median = median(Median)) %>%
mutate(Major_category = fct_reorder(Major_category, Median)) %>%
ggplot(aes(Major_category, Median)) +
geom_col(fill = "lightgreen") +
scale_y_continuous(labels = dollar_format()) +
coord_flip()
What are the most common majors? (Add a bit colors)
majors_processed %>%
mutate(Major = fct_reorder(Major, Total)) %>%
arrange(desc(Total)) %>%
head(20) %>%
ggplot(aes(Major, Total, fill = Major_category)) +
geom_col() +
coord_flip() +
scale_y_continuous(labels = comma_format()) +
labs(x = "", y = "Total number of graduates") +
theme(legend.position = "none")
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Major = fct_reorder(Major, Total)`.
## Caused by warning:
## ! `fct_reorder()` removing 1 missing value.
## ℹ Use `.na_rm = TRUE` to silence this message.
## ℹ Use `.na_rm = FALSE` to preserve NAs.
Highest Earning Major (in detail)?
majors_processed %>%
filter(Sample_size >= 100) %>%
head(20) %>%
ggplot(aes(Major, Median, color = Major_category)) +
geom_point() +
geom_errorbar(aes(ymin = P25th, ymax = P75th)) +
expand_limits(y = 0) +
scale_y_continuous(labels = dollar_format()) +
coord_flip() +
labs(title = "What are the highest-earning majors?",
subtitle = "Top 20 majors with at least 100 graduates surveyed",
x = "",
y = "Median salary")
What are the lowest earning majors (in detail)?
majors_processed %>%
filter(Sample_size >= 100) %>%
tail(20) %>%
ggplot(aes(Major, Median, color = Major_category)) +
geom_point() +
geom_errorbar(aes(ymin = P25th, ymax = P75th)) +
expand_limits(y = 0) +
coord_flip()
I added some color and make the graph interactive in terms of Median Salary and also order it make it easier to look and detect outliers.
interactive_plot <- ggplot(majors_processed, aes(Major_category, Median, fill = Major_category)) +
geom_boxplot() +
expand_limits(y = 0) +
coord_flip() +
theme(legend.position = "none")
plotly::ggplotly(interactive_plot)
Another interesting interactive plot that I also want to examine the proportion of genders for each major to see if there is something exciting.
gen_inter <- majors_processed %>%
arrange(desc(Total)) %>%
head(20) %>%
mutate(Major = fct_reorder(Major, Total)) %>%
gather(Gender, Number, Men, Women) %>%
ggplot(aes(Major, Number, fill = Gender)) +
geom_col() +
coord_flip()
plotly::ggplotly(gen_inter)
ggplot(data = by_major_category, aes(x = Major_category, y = College_jobs, fill = College_jobs)) + geom_bar(stat = "identity") +
labs(x = "Major", y = "Amount of Jobs Need a Degree (2012)") +
scale_y_continuous(labels = comma_format()) +
theme(legend.position = "none") +
coord_flip()
> Based on the bar chart, we can clearly sees that Education,
Engineering and Health industries requires a College Degree since it
requires a lot of expertise. I think it does not change until today
also.
Since I see there are some outliers, I want to see which major has high salary 9 years back.
head(recent_grads %>%
arrange(desc(Median)) %>%
select(Major, Major_category, Median))
## Major Major_category Median
## 1 PETROLEUM ENGINEERING Engineering 110000
## 2 MINING AND MINERAL ENGINEERING Engineering 75000
## 3 METALLURGICAL ENGINEERING Engineering 73000
## 4 NAVAL ARCHITECTURE AND MARINE ENGINEERING Engineering 70000
## 5 CHEMICAL ENGINEERING Engineering 65000
## 6 NUCLEAR ENGINEERING Engineering 65000
Turns out it is Petroleum Engineering that has outliers. But there might be no surprise that the major has a very high salary and can be much higher nowadays.
ggplot(majors_processed, aes(Sample_size, Median)) +
geom_point() +
geom_text(aes(label = Major), check_overlap = TRUE, vjust = 1, hjust = 1) +
scale_y_continuous()+
scale_x_log10()
Another aspect I want to explore is the Unemployment Rate for each Major Category. I think this is quite interesting to see 9 years ago which industry has a highest unemployment rate.
majors_processed %>%
group_by(Major_category) %>%
ggplot(aes(x = fct_reorder(Major_category, Unemployment_rate), y = Unemployment_rate)) +
geom_bar(stat = "identity", fill = "pink") +
coord_flip() +
labs(title = "Unemployment Rate by Major Category", x = "Major Category", y = "Unemployment Rate")
I got to learned Linear Regression in one of my classes, I would like to apply the knowledge to the predictor which is: “ShareWomen” and the outcome as “Median”.
majors_processed %>%
select(Major, Total, ShareWomen, Sample_size, Median) %>%
lm(Median ~ ShareWomen, data = ., weights = Sample_size) %>%
summary()
##
## Call:
## lm(formula = Median ~ ShareWomen, data = ., weights = Sample_size)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -260500 -61042 -13899 33262 865081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52073 1436 36.255 <2e-16 ***
## ShareWomen -23650 2403 -9.842 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 123000 on 170 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.363, Adjusted R-squared: 0.3592
## F-statistic: 96.87 on 1 and 170 DF, p-value: < 2.2e-16
The analysis found that there is a statistically significant negative relationship between the proportion of female graduates and median earnings, meaning that as the proportion of female graduates in a major increases, the median earnings of graduates in that major decrease. Specifically, for each 1% increase in the proportion of female graduates, the median earnings decrease by an estimated $23,650, after adjusting for the sample size of each major. The adjusted R-squared value of 0.3592 indicates that about 36% of the variance in median earnings can be explained by the proportion of female graduates, after taking into account the sample size of each major. The residual standard error of $123,000 suggests that there is still a lot of unexplained variability in the data, meaning that other factors besides the proportion of female graduates are likely also influencing median earnings.
Overall, I think this dataset is very interesting to explore and wrangle since it is 9 years ago. There are a lot of interesting findings such as the unemployment rate 9 years ago which falls into Engineering industry, this is also the highest paid and most of them requires a College degree. I also applied Linear Regression model above to see the relationship between Median Salary with the amount of Female Graduates.
Reflection: I personally think this is an awesome activity to enhance and improve our R skills as well as data analysis skills. I also did learn new things throughout the course of exploring the dataset. What I would do different is that I would read and learn the dataset more coherently next time before dive into data wrangling and visualizing. Last but not least, I would choose appropriate color theme throughout the graphs in order to make the audience see comfortably.